Search CORE

41 research outputs found

Depth Separation for Neural Networks

Author: Daniely Amit
Publication venue
Publication date: 27/02/2017
Field of study

Let

f:\mathbb{S}^{d-1}\times \mathbb{S}^{d-1}\to\mathbb{S}

be a function of the form

f(\mathbf{x},\mathbf{x}') = g(\langle\mathbf{x},\mathbf{x}'\rangle)

for

g:[-1,1]\to \mathbb{R}

. We give a simple proof that shows that poly-size depth two neural networks with (exponentially) bounded weights cannot approximate

f

whenever

g

cannot be approximated by a low degree polynomial. Moreover, for many

g

's, such as

g(x)=\sin(\pi d^3x)

, the number of neurons must be

2^{\Omega\left(d\log(d)\right)}

. Furthermore, the result holds w.r.t.\ the uniform distribution on

\mathbb{S}^{d-1}\times \mathbb{S}^{d-1}

. As many functions of the above form can be well approximated by poly-size depth three networks with poly-bounded weights, this establishes a separation between depth two and depth three networks w.r.t.\ the uniform distribution on

\mathbb{S}^{d-1}\times \mathbb{S}^{d-1}

arXiv.org e-Print Archive

Complexity Theoretic Limitations on Learning Halfspaces

Author: Daniely Amit
Publication venue
Publication date: 13/03/2016
Field of study

We study the problem of agnostically learning halfspaces which is defined by a fixed but unknown distribution

\mathcal{D}

\mathbb{Q}^n\times \{\pm 1\}

. We define

\mathrm{Err}_{\mathrm{HALF}}(\mathcal{D})

as the least error of a halfspace classifier for

\mathcal{D}

. A learner who can access

\mathcal{D}

has to return a hypothesis whose error is small compared to

\mathrm{Err}_{\mathrm{HALF}}(\mathcal{D})

. Using the recently developed method of the author, Linial and Shalev-Shwartz we prove hardness of learning results under a natural assumption on the complexity of refuting random

K

\mathrm{XOR}

formulas. We show that no efficient learning algorithm has non-trivial worst-case performance even under the guarantees that

\mathrm{Err}_{\mathrm{HALF}}(\mathcal{D}) \le \eta

for arbitrarily small constant

\eta>0

, and that

\mathcal{D}

is supported in

\{\pm 1\}^n\times \{\pm 1\}

. Namely, even under these favorable conditions its error must be

\ge \frac{1}{2}-\frac{1}{n^c}

for every

c>0

. In particular, no efficient algorithm can achieve a constant approximation ratio. Under a stronger version of the assumption (where

K

can be poly-logarithmic in

n

), we can take

\eta = 2^{-\log^{1-\nu}(n)}

for arbitrarily small

\nu>0

. Interestingly, this is even stronger than the best known lower bounds (Arora et. al. 1993, Feldamn et. al. 2006, Guruswami and Raghavendra 2006) for the case that the learner is restricted to return a halfspace classifier (i.e. proper learning)

arXiv.org e-Print Archive

Locally Private Learning without Interaction Requires Separation

Author: Daniely Amit
Feldman Vitaly
Publication venue
Publication date: 28/10/2019
Field of study

We consider learning under the constraint of local differential privacy (LDP). For many learning problems known efficient algorithms in this model require many rounds of communication between the server and the clients holding the data points. Yet multi-round protocols are prohibitively slow in practice due to network latency and, as a result, currently deployed large-scale systems are limited to a single round. Despite significant research interest, very little is known about which learning problems can be solved by such non-interactive systems. The only lower bound we are aware of is for PAC learning an artificial class of functions with respect to a uniform distribution (Kasiviswanathan et al. 2011). We show that the margin complexity of a class of Boolean functions is a lower bound on the complexity of any non-interactive LDP algorithm for distribution-independent PAC learning of the class. In particular, the classes of linear separators and decision lists require exponential number of samples to learn non-interactively even though they can be learned in polynomial time by an interactive LDP algorithm. This gives the first example of a natural problem that is significantly harder to solve without interaction and also resolves an open problem of Kasiviswanathan et al. (2011). We complement this lower bound with a new efficient learning algorithm whose complexity is polynomial in the margin complexity of the class. Our algorithm is non-interactive on labeled samples but still needs interactive access to unlabeled samples. All of our results also apply to the statistical query model and any model in which the number of bits communicated about each data point is constrained

arXiv.org e-Print Archive

The price of bandit information in multiclass online classification

Author: Daniely Amit
Helbertal Tom
Publication venue
Publication date: 09/07/2013
Field of study

We consider two scenarios of multiclass online learning of a hypothesis class

H\subseteq Y^X

. In the {\em full information} scenario, the learner is exposed to instances together with their labels. In the {\em bandit} scenario, the true label is not exposed, but rather an indication whether the learner's prediction is correct or not. We show that the ratio between the error rates in the two scenarios is at most

8\cdot|Y|\cdot \log(|Y|)

in the realizable case, and

\tilde{O}(\sqrt{|Y|})

in the agnostic case. The results are tight up to a logarithmic factor and essentially answer an open question from (Daniely et. al. - Multiclass learnability and the erm principle). We apply these results to the class of

\gamma

-margin multiclass linear classifiers in

\reals^d

. We show that the bandit error rate of this class is

\tilde{\Theta}(\frac{|Y|}{\gamma^2})

in the realizable case and

\tilde{\Theta}(\frac{1}{\gamma}\sqrt{|Y|T})

in the agnostic case. This resolves an open question from (Kakade et. al. - Efficient bandit algorithms for online multiclass prediction)

arXiv.org e-Print Archive

Tight products and Expansion

Author: Daniely Amit
Linial Nathan
Publication venue
Publication date: 03/11/2012
Field of study

In this paper we study a new product of graphs called {\em tight product}. A graph

H

is said to be a tight product of two (undirected multi) graphs

G_1

and

G_2

, if

V(H)=V(G_1)\times V(G_2)

and both projection maps

V(H)\to V(G_1)

and

V(H)\to V(G_2)

are covering maps. It is not a priori clear when two given graphs have a tight product (in fact, it is

NP

-hard to decide). We investigate the conditions under which this is possible. This perspective yields a new characterization of class-1

(2k+1)

-regular graphs. We also obtain a new model of random

d

-regular graphs whose second eigenvalue is almost surely at most

O(d^{3/4})

. This construction resembles random graph lifts, but requires fewer random bits

arXiv.org e-Print Archive

Competitive ratio versus regret minimization: achieving the best of both worlds

Author: Daniely Amit
Mansour Yishay
Publication venue
Publication date: 07/04/2019
Field of study

We consider online algorithms under both the competitive ratio criteria and the regret minimization one. Our main goal is to build a unified methodology that would be able to guarantee both criteria simultaneously. For a general class of online algorithms, namely any Metrical Task System (MTS), we show that one can simultaneously guarantee the best known competitive ratio and a natural regret bound. For the paging problem we further show an efficient online algorithm (polynomial in the number of pages) with this guarantee. To this end, we extend an existing regret minimization algorithm (specifically, Kapralov and Panigrahy) to handle movement cost (the cost of switching between states of the online system). We then show how to use the extended regret minimization algorithm to combine multiple online algorithms. Our end result is an online algorithm that can combine a "base" online algorithm, having a guaranteed competitive ratio, with a range of online algorithms that guarantee a small regret over any interval of time. The combined algorithm guarantees both that the competitive ratio matches that of the base algorithm and a low regret over any time interval. As a by product, we obtain an expert algorithm with close to optimal regret bound on every time interval, even in the presence of switching costs. This result is of independent interest

arXiv.org e-Print Archive

Complexity theoretic limitations on learning DNF's

Author: Daniely Amit
Shalev-Shwatz Shai
Publication venue
Publication date: 04/11/2014
Field of study

Using the recently developed framework of [Daniely et al, 2014], we show that under a natural assumption on the complexity of refuting random K-SAT formulas, learning DNF formulas is hard. Furthermore, the same assumption implies the hardness of learning intersections of

\omega(\log(n))

halfspaces, agnostically learning conjunctions, as well as virtually all (distribution free) learning problems that were previously shown hard (under complexity assumptions).Comment: arXiv admin note: substantial text overlap with arXiv:1311.227

arXiv.org e-Print Archive

Optimal Learners for Multiclass Problems

Author: Daniely Amit
Shalev-Shwartz Shai
Publication venue
Publication date: 10/05/2014
Field of study

The fundamental theorem of statistical learning states that for binary classification problems, any Empirical Risk Minimization (ERM) learning rule has close to optimal sample complexity. In this paper we seek for a generic optimal learner for multiclass prediction. We start by proving a surprising result: a generic optimal multiclass learner must be improper, namely, it must have the ability to output hypotheses which do not belong to the hypothesis class, even though it knows that all the labels are generated by some hypothesis from the class. In particular, no ERM learner is optimal. This brings back the fundmamental question of "how to learn"? We give a complete answer to this question by giving a new analysis of the one-inclusion multiclass learner of Rubinstein et al (2006) showing that its sample complexity is essentially optimal. Then, we turn to study the popular hypothesis class of generalized linear classifiers. We derive optimal learners that, unlike the one-inclusion algorithm, are computationally efficient. Furthermore, we show that the sample complexity of these learners is better than the sample complexity of the ERM rule, thus settling in negative an open question due to Collins (2005)

arXiv.org e-Print Archive

Memorizing Gaussians with no over-parameterizaion via gradient decent on neural networks

Author: Daniely Amit
Publication venue
Publication date: 28/03/2020
Field of study

We prove that a single step of gradient decent over depth two network, with

q

hidden neurons, starting from orthogonal initialization, can memorize

\Omega\left(\frac{dq}{\log^4(d)}\right)

independent and randomly labeled Gaussians in

\mathbb{R}^d

. The result is valid for a large class of activation functions, which includes the absolute value

arXiv.org e-Print Archive

Neural Networks Learning and Memorization with (almost) no Over-Parameterization

Author: Daniely Amit
Publication venue
Publication date: 22/11/2019
Field of study

Many results in recent years established polynomial time learnability of various models via neural networks algorithms. However, unless the model is linear separable, or the activation is a polynomial, these results require very large networks -- much more than what is needed for the mere existence of a good predictor. In this paper we prove that SGD on depth two neural networks can memorize samples, learn polynomials with bounded weights, and learn certain kernel spaces, with near optimal network size, sample complexity, and runtime. In particular, we show that SGD on depth two network with

\tilde{O}\left(\frac{m}{d}\right)

hidden neurons (and hence

\tilde{O}(m)

parameters) can memorize

m

random labeled points in

\mathbb{S}^{d-1}

arXiv.org e-Print Archive